feat: make mesh accept meshcontext#2266
Conversation
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
|
/ok to test 3dcadfb |
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
|
/ok to test a8b2df6 |
|
/ok to test 4a4ba1a |
…ntrol (#2444) * feat(speculative): add reasoning mode control for EAGLE/P-EAGLE/DFlash training Add --reasoning {none,save,disable} flag to regenerate.py for controlling whether target model reasoning content is preserved or suppressed during data regeneration. Add mask_reasoning_content option to EAGLE/P-EAGLE/DFlash training recipes to exclude reasoning traces from the loss mask. Co-authored-by: khazic <khazzz1c@gmail.com> Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> * feat(speculative): add EAGLE-3 sequence packing for draft training Pack variable-length chat samples into fixed-width rows for EAGLE-3 training, removing the per-sample padding waste of the default max_length path. Documents within a row attend block-causally: the target uses a 4D block-causal mask (SDPA) and the draft uses varlen FlashAttention-2; cross-document TTT supervision is gated by doc_remaining so deeper steps never leak across boundaries. Opt-in via packed_sequence_size > 0, colocated target backend only. Covered by unit tests plus an FA2-vs-eager parity test. Co-authored-by: khazic <khazzz1c@gmail.com> Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> --------- Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> Co-authored-by: thyways <2484113689@qq.com> Co-authored-by: Huiying <willwin.lee@gmail.com>
jgerh
left a comment
There was a problem hiding this comment.
Completed tech pubs review of docs/guides/gradient-checkpointing.md and provided a few suggestions
…2389) * feat(distributed): add selective activation checkpointing for FSDP2 Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(distributed): support selective activation checkpointing with torch.compile Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(fern): drop selective AC from frozen v0.4 snapshot Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(distributed): honor selective activation checkpointing on single GPU Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(moe): support selective activation checkpointing with expert parallelism Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(model): make DeepSeek MLP dispatch wrapper-safe Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(distributed): save expert grouped-GEMM in selective AC and add op trace Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(moe): compile selective activation checkpointing wrappers outer Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * refactor(distributed): move selective AC into its own module Extract the TorchTitan-style selective activation checkpointing core out of the central parallelizer.py into a dedicated activation_checkpointing.py: op-set construction, the save/recompute policy, block/sub-module wrappers, KV-sharing detection, and the compile-outer wrapper flag. parallelizer.py keeps only the thin apply_selective_activation_checkpointing entry point, which still needs the heavy, transformers-aware _extract_model_layers, so the dependency stays one-directional (parallelizer -> activation_checkpointing -> parallelizer_utils) with no circular imports. Move the opt-in NEMO_SELECTIVE_AC_TRACE diagnostic out of parallelizer.py into parallelizer_utils.maybe_trace_selective_ac_decision so the hot policy is a single call site instead of trace globals plus a helper. Make the new module's cross-module interface public (drop the leading underscore) and keep internal op-resolution/plumbing private. Update the moe and fsdp2 consumers and the unit tests to import from the new module. Also fix doc wording: clarify that torch.compile must be held fixed when comparing full vs. selective, and refer to TorchTitan as a reference implementation rather than "upstream". Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * refactor(distributed): move selective-AC trace into the AC module Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * test(distributed): patch activation_checkpointing.checkpoint_wrapper after AC module split Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs: apply tech-writer edits to gradient-checkpointing guide Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> --------- Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
* ci: add nemo-run, split qwen-vl-utils from decord for arm Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: override in pytorch container Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update uv lock Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
…2419) * fix(transformers): unify loaded HF dtype via promote_types Make _restore_loaded_model_dtype dtype-aware: instead of always restoring to the checkpoint dtype, unify each floating tensor to promote_types(checkpoint, requested). This honors an explicit fp32 request while preserving intrinsically-fp32 checkpoint params (e.g. A_log) under a bf16 request, and is a no-op for the bf16/auto path. Fixes FSDP2 uniform-dtype tripping on HF mixed-dtype loads. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(distributed): default pipeline dtype to FSDP activation dtype When pipeline parallelism is enabled and pipeline.dtype is unset, derive it from the FSDP mixed-precision activation dtype (mp_policy.output_dtype, falling back to param_dtype) so pipeline stage shape inference matches the real activation dtype (e.g. bf16 compute under fp32 master weights). An explicitly set pipeline.dtype is honored but warned on mismatch, since it can corrupt inter-stage recv buffers. No-ops for strategies without an mp_policy (e.g. MegatronFSDP) and for pp_size==1. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit 3f6b246) * refactor(distributed): resolve FSDP compute dtype per-param, decoupled from storage fully_shard_by_dtype now groups parameters by their required *compute* dtype instead of their storage dtype, so fp32 master weights (uniform fp32 storage) still compute the bulk in mp_policy.param_dtype (bf16) while intrinsically-fp32 params keep fp32 compute. Per-parameter compute dtype is resolved by precedence: pinned fp32 (_keep_in_fp32_modules_strict) > HF-recorded checkpoint dtype (tagged onto each tensor at load time in _restore_loaded_model_dtype) > mp_policy.param_dtype. Qwen3.5's GatedDeltaNet fp32 holder is declared via patch_hf_model; the NemotronH and Qwen3.5 strategies thread the declaration through. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit 3dd6b97) * docs(model-onboarding): document _keep_in_fp32_modules_strict contract Add SKILL.md §2.6 explaining which params must compute in fp32 (SSM A_log/ dt_bias/D, MoE sigmoid-gate bias, attention-sink bias, scale), how to declare them (class attribute vs patch_hf_model instance attribute), and why the pin is the robust signal across all load paths. Broaden the MoE checklist item and code comment accordingly. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit a11db38) * test(distributed): add fp32 compute-dtype contract test Assert the resident compute dtype of every trainable parameter across the model archetypes that use fully_shard_by_dtype (dense, Qwen3.5-style hybrid), covering the full precedence chain: pinned fp32 > HF-recorded dtype > mp_policy.param_dtype, under fp32 master weights and ordinary loads. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit dc83926) * feat(model): cast frozen modules to compute dtype to avoid mismatch Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit d321f5e) * refactor(gemma4): drop projector dtype hook now general frozen cast handles it Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit 1bc67e2) * feat(training): add dormant resolve_storage_dtype helper Add resolve_storage_dtype() (and its unit tests) for defaulting model.torch_dtype to fp32 for full-parameter torch.optim training. Not yet wired into recipes here; the call sites are marked with breadcrumb comments and enabled in a follow-up PR, keeping this PR limited to dtype bug fixes with no behavior/memory change. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(model): cast frozen-module buffers and unsharded params to compute dtype Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(infra): correct frozen-tower FSDP comment to match sharding reality Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(mixed-precision): clarify TE vs torch AdamW memory and precision trade-offs Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(mixed-precision): apply tech writer edits Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(mixed-precision): drop unresolvable FSDP anchor Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> --------- Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
…2448) Add examples/speculative/README.md covering the whole speculative-decoding draft-training subsystem: supported methods (EAGLE-1/2/3/3.1, P-EAGLE, DFlash), target-model registry coverage, compute backends (eager vs flash_attention_2, flex_attention/sdpa, fused Triton soft cross-entropy, d2t/t2d draft-vocab compression), target backends (co-located, remote, offline cache), serving and benchmarking, inference-engine compatibility, and a consolidated config reference. Fold the standalone regenerate_with_target.md into the README's data preparation section (full two-step flow, tuning table, pitfalls) and remove the separate file so there is a single entry point. Signed-off-by: khazic <khazzz1c@gmail.com>
) * feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support Signed-off-by: linnan wang <linnanw@nvidia.com> * fix the memory management for training large 14B wan model * fix wan2.2 support * all good for wan2.2 * update Signed-off-by: linnan wang <linnanw@nvidia.com> * docs(fern): add Wan2.2 T2V-A14B model coverage and release log entry Signed-off-by: linnan wang <linnanw@nvidia.com> * fix anther round of code review * fix(diffusion): sort wan.py imports to satisfy CI isort (I001) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(diffusion): load inference checkpoints to CPU to halve peak GPU memory Avoids doubling peak GPU memory (and a potential OOM in Wan2.2 two-stage inference) by loading EMA/consolidated state dicts with map_location="cpu"; load_state_dict copies into the already-on-device parameters. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Signed-off-by: linnan wang <linnanw@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Resolve conflicts between the MeshContext/DistributedSetup refactor and main's selective activation checkpointing (#2389), FSDP2 dtype fixes (#2419), and DDP find_unused_parameters: - config.py: keep the DistributedSetup/MoEParallelizerConfig refactor and the DistributedStrategyConfig rename; fold in ActivationCheckpointingMode + a back-compat DistributedConfig alias; widen DistributedSetup.activation_checkpointing; DDPConfig gains find_unused_parameters and drops backend. - mesh.py: MeshContext stays pure topology (strategy/pipeline/moe/AC fields removed); main's AC-type change there is moot. - infrastructure.py: keep moe_parallel_config param + cast_frozen_modules import; drop the relocated moe.config MoEParallelizerConfig import; widen activation_checkpointing. - ddp.py / diffusion: preserve find_unused_parameters via DDPConfig, drop backend. - multimodal/finetune.py: fix moe_config= -> moe_parallel_config= to match the API. - tests: align dist_utils + diffusion DDP tests with the new DistributedSetup API.
|
/ok to test 7a55fc1 |
The DDP strategy config exposes find_unused_parameters (default False), so _build_diffusion_parallel_manager_args returns it in the ddp branch. Update the test's expected dict to match, fixing the L0 unit test failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/claude review |
|
/ok to test 42b703f |
| if moe_parallel_config is None: | ||
| moe_parallel_config = MoEParallelizerConfig() | ||
| parallelize_fn = partial( | ||
| parallelize_model, | ||
| activation_checkpointing=activation_checkpointing, | ||
| **moe_parallel_config.to_dict(), |
There was a problem hiding this comment.
Bug: the old code forwarded model_wrapper.mp_policy (from FSDP2Config) to the MoE parallelizer when MoEParallelizerConfig.mp_policy was None:
# old code
moe_kwargs = moe_config.to_dict()
if moe_kwargs.get("mp_policy") is None and model_wrapper is not None:
moe_kwargs["mp_policy"] = getattr(model_wrapper, "mp_policy", None)This ensured that a custom mp_policy on FSDP2Config (e.g. fp16 or custom reduce_dtype) propagated to expert sharding. The new code doesn't forward it — MoEParallelizerConfig.mp_policy defaults to None, and the MoE parallelizer falls back to its own hardcoded bf16/fp32 default.
For the default config this is identical (both default to bf16/fp32), but for users passing a custom mp_policy on FSDP2Config with EP models, the MoE sharding will silently ignore their precision choice. Consider restoring the forwarding:
| if moe_parallel_config is None: | |
| moe_parallel_config = MoEParallelizerConfig() | |
| parallelize_fn = partial( | |
| parallelize_model, | |
| activation_checkpointing=activation_checkpointing, | |
| **moe_parallel_config.to_dict(), | |
| if moe_parallel_config is None: | |
| moe_parallel_config = MoEParallelizerConfig() | |
| moe_kwargs = moe_parallel_config.to_dict() | |
| if moe_kwargs.get("mp_policy") is None and model_wrapper is not None: | |
| moe_kwargs["mp_policy"] = getattr(model_wrapper, "mp_policy", None) | |
| parallelize_fn = partial( | |
| parallelize_model, | |
| activation_checkpointing=activation_checkpointing, | |
| **moe_kwargs, | |
| ) |
| components/distributed/mesh.py | ||
| MeshContext -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config | ||
| Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size | ||
| STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig} | ||
| MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD |
There was a problem hiding this comment.
Stale documentation: this block describes the pre-refactor MeshContext. After this PR:
MeshContextno longer hasstrategy_config,pipeline_config, ormoe_config— those moved toDistributedSetup.STRATEGY_MAPwas removed frommesh.py— it's now_STRATEGY_MAPinconfig.py.
| components/distributed/mesh.py | |
| MeshContext -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config | |
| Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size | |
| STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig} | |
| MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD | |
| components/distributed/mesh.py | |
| MeshContext -- device_mesh, moe_mesh | |
| Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size | |
| MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD |
| ``` | ||
| components/moe/config.py | ||
| MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc. |
There was a problem hiding this comment.
Stale path: MoEParallelizerConfig was moved to components/distributed/config.py in this PR.
| ``` | |
| components/moe/config.py | |
| MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc. | |
| components/distributed/config.py | |
| MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc. | |
| components/moe/config.py | |
| MoEConfig -- n_routed_experts, n_activated_experts, score_func, etc. |
|
/claude review |
|
/ok to test 300109d |
Resolve conflicts from main (14 commits) against the distributed-config refactor. Key resolutions: - backend: keep the PR's removal of the configurable per-strategy `backend` (DDPConfig has no backend; managers don't take it; tests assert its absence). backend remains a process-group concern (dist_env / init). - config.py / mesh.py / infrastructure.py: keep the PR's DistributedSetup/MeshContext structure and moe_parallel_config naming. - activation checkpointing: keep the PR's design (carried on the parsed value and injected onto the strategy config later via infrastructure._with_activation_checkpointing, not in parse_distributed_section). Deduped a merge-duplicated _normalize_activation_checkpointing; updated the two selective-AC tests from main to assert the PR's behavior (AC stays off strategy_config). - skills model-onboarding SKILL.md: take main's new "Declare model capabilities" section. moe/parallelizer.py: take main's _moe_shard_placement helper (it is used). _dist_utils.py: drop main's unused `import dataclasses`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/ok to test 994ad67 |
- infrastructure.py: forward the model wrapper's mp_policy (from FSDP2Config) to the MoE expert parallelizer when MoEParallelizerConfig.mp_policy is unset, so a custom precision policy isn't silently dropped for EP models. - skills/nemo-automodel-distributed-training/SKILL.md: fix stale references — MeshContext no longer holds strategy_config/pipeline_config/moe_config and STRATEGY_MAP moved to _STRATEGY_MAP in config.py; MoEParallelizerConfig now lives in components/distributed/config.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
/claude review |
|
/ok to test bbd2d61 |
There was a problem hiding this comment.
LGTM
Clean, well-structured refactoring that consolidates distributed setup into a single DistributedSetup object. The new layering (topology in MeshContext, policies in DistributedSetup) is clear and consistent across all recipe callsites. Test coverage is thorough — all major new code paths (DistributedSetup.build(), _resolve_distributed_setup(), _reject_separate_distributed_kwargs(), the backend removal, MegatronFSDP aliases) have dedicated tests. Skill documentation is updated to match the new API. No bugs, logic errors, or typos found.
What does this PR do?
Refactors the distributed public API so topology and distributed policies are layered explicitly.
The main user-facing object is now
DistributedSetup, which owns:mesh_context: runtime topology andDeviceMesh/ MoE mesh accessstrategy_config: FSDP2 / Megatron FSDP / DDP strategy configpipeline_config: pipeline-parallel runtime configmoe_parallel_config: MoE parallelization configactivation_checkpointing: activation-checkpointing policyMeshContextis narrowed to topology only. It no longer owns activation checkpointing or higher-level training policy.Changelog
DistributedSetup.build(...)as the component-layer entry point for constructing distributed setup from strategy, parallelism sizes, pipeline config, MoE config, and activation checkpointing.device_meshcompatibility inNeMoAutoModel*.from_pretrainedby wrapping raw HF-style meshes into an internal topology-onlyDistributedSetup.device_mesh.pyand move raw mesh construction/access helpers intomesh_utils.py.ParallelismSizesfordp/tp/pp/cp/epsizing intent.MoEParallelizerConfiginto distributed config, since it is part of distributed setup rather than model-only MoE config.DistributedSetupfrom YAML/programmatic config and fan out the derived runtime attributes consistently.device_meshcompatibility.API shape
Python usage:
HF-compatible raw mesh usage is still allowed:
Future work
Currently FSDP2Config is not pure FSDP, but also includes options for TP/SP; those will be refactored in a follow-up PR to separate concerns.
Before your PR is "Ready for review"
Pre checks:
Validation:
python -m ruff check ...python -m ruff format --check ...python -m py_compile ...pytest tests/unit_tests/recipes/test_dist_utils.py -qNote: local full recipe test collection is blocked in my environment by an existing
mlflow/cachetools.func.cachedimport mismatch. CI should be used for full CPU coverage.Additional Information
This keeps the TorchTitan-like layering:
ParallelismSizesMeshContextDistributedSetupcreate_distributed_setup_from_config